Genome Medicine
○ Springer Science and Business Media LLC
Preprints posted in the last 90 days, ranked by how well they match Genome Medicine's content profile, based on 154 papers previously published here. The average preprint has a 0.32% match score for this journal, so anything above that is already an above-average fit.
Foguet, C.; Gil, L.; Xu, Y.; Salazar-Magana, S.; Rtichie, S. C.; Persyn, E.; Im, H. K.; Inouye, M.; Lambert, S. A.
Show abstract
Genetic prediction of multi-omic data has emerged as a cost-effective alternative to direct omics profiling, particularly useful for identifying molecular features associated with disease susceptibility. However, despite its popularity, multi-omic imputation models are fragmented across studies, hindering findability, accessibility, interoperability and re-use. To address this, we developed OmicsPred (https://www.omicspred.org), a centralised platform for the deposition and dissemination of genetic prediction models of multi-omic traits. OmicsPred unifies the most commonly used molecular imputation models (e.g. from PredictDB) and other published studies totalling 3,339,469 prediction models spanning transcriptomic, proteomic, and metabolomic traits (as of May 2026). Each model is accompanied by metadata describing score development and predictive performance, and distributed in formats compatible with popular analytic tools, such as PGS Catalog Calculator and MetaXcan. To demonstrate the utility of the resource for systematic target discovery, we perform a multi-omic phenome-wide association analysis in Million Veterans Program data.
Shang, Y.; Badonyi, M.; Marsh, J. A.
Show abstract
Interpreting the clinical significance of missense variants of uncertain significance (VUS) remains a major challenge in clinical genetics. Although computational variant effect predictors (VEPs) and multiplexed assays of variant effect (MAVEs) can generate large-scale functional scores, their value is typically assessed using discrimination metrics such as AUROC rather than by the strength of evidence they provide under ACMG/AMP guidelines. Here, we introduce mean evidence strength (MES), a quantitative metric that summarises the pathogenic and benign evidence assigned across missense variants following gene-level Bayesian calibration. Using the acmgscaler framework, we calibrated 12 population-free VEPs across 367 disease genes and analysed 15 MAVE datasets with sufficient clinical data. MES revealed important discrepancies with AUROC, including cases where methods with similar discrimination differed substantially in evidence yield. MAVEs achieved high average MES despite lower AUROC, while several VEPs showed strong discrimination but more limited calibrated evidence. Among predictors, CPT-1 achieved the highest MES and provided moderate or stronger evidence for the largest fraction of ClinVar VUS. MES therefore provides a practical framework for evaluating computational and experimental variant effect datasets in terms of calibrated clinical evidence yield.
Rowlands, C. F.; Choi, S.; Allen, S.; Kuzbari, Z.; Cubuk, C.; Sultana, R.; Torr, B.; Durkie, M.; Burghel, G. J.; Robinson, R.; Callaway, A.; Field, J.; Frugtniet, B.; Palmer-Smith, S.; Grant, J.; Pagan, J.; Johnston, E.; McDevitt, T.; Hughes, L.; Yarram-Smith, L.; Logan, P.; Reed, L.; Snape, K.; McVeigh, T.; Hanson, H.; Garrett, A.; Turnbull, C.; CanVIG-UK,
Show abstract
Interpretation of germline variants in cancer susceptibility genes (CSGs) requires the collation of variant-level data from diverse sources, as well as the assembly of comprehensive clinical data, often necessitating sharing of information between genomic testing centers. Although a number of variant interpretation tools exist, there remains a need for a CSG-focused platform tailored to the diverse range of ClinGen variant curation expert panel guidance in these difficult-to-interpret genes. Here, we describe CanVar-UK, a freely-accessible web platform to assist in the interpretation of germline CSG variants. CanVar-UK contains variant-level data for over 1.7 million single nucleotide variants, comprising all coding variants in 115 established CSGs. These data include: in silico scores from 11 tools of clinical relevance; population allele frequencies from gnomAD v4.1 and case counts from NHS genomic testing via linkage to the National Disease Registration Service; variant-level readouts from 31 different functional and splicing studies across 13 CSGs; genetic epidemiology studies of the BRCA1/2 genes; and live linkage to existing consensus classifications in the ClinVar database. CanVar-UK additionally has a diagnostic discussion forum functionality, via which users are able to email the rest of the user base with queries and/or suggested classifications, facilitating the exchange of clinical and classification data between diagnostic centers. Already widely used by the NHS clinical workforce in the CSG space (with 879 registered NHS users), CanVar-UK has a rapidly growing international user base, with 607 registered users based outside the UK. We believe CanVar-UK to be an invaluable resource for germline CSG variant interpretation.
Schreiner, P. A.; Markianos, K.; Francis, M.; Despard, B.; Gorman, B. R.; Said, I.; Dong, F.; Gautam, S.; Dochtermann, D.; Shi, Y.; Devineni, P.; Kirkpatrick, C.; Khazanov, N.; Moser, J.; Million Veteran Program, ; Huang, G. D.; Muralidhar, S.; Tsao, P. S.; Pyarajan, S.
Show abstract
The Million Veteran Program (MVP) represents the largest and one of the most diverse single cohorts associated with longitudinal Electronic Health Record data (EHR) data. We profiled a subset of samples from MVP using the Illumina Infinium MethylationEPIC Beadchip (EPIC array) to generate one of the largest single cohort methylation dataset to-date. Methylation profiles were analyzed for 45,460 total individuals, with the most populous ancestries composed of 27,455 Europeans, 11,798 African Americans, and 4,859 Admixed Americans. We detail the strict quality control standards implemented to ensure the most robust method of methylation profiling of the MVP cohort. This dataset was then applied to evaluate the effects of smoking exposure on DNA methylation in MVP participants. Ancestry-stratified epigenome-wide association studies (EWAS) of smoking status (ever/never) were performed using over 750,000 probes with certifiable signal. Our multi-ancestry meta-analysis demonstrates replicability with existing EWAS and identifies 3,207 novel probe-smoking associations unlocked via the depth and breadth of data in this cohort.
Nazari, I.; Ennis, S.; Ashton, J.; Cheng, G.
Show abstract
Interpretation of rare-disease genomes remains constrained by variant-centric analytical frameworks that insufficiently capture the cumulative impact of multiple variants within a gene. GenePy provides an individual-level, gene-based burden metric that integrates variant consequence, allele frequency, and zygosity into a unified quantitative score, enabling a transition from discrete variant annotation to aggregated gene-level interpretation. In the context of Genomics England, this formulation supports a panel-agnostic, genotype-to-phenotype diagnostic strategy for unresolved monogenic disorders by prioritising genes with elevated mutational burden per individual. Here, we present a fully automated, containerised GenePy workflow deployed through Nextflow and integrated within the Genomics England (GEL) Research Environment via the Lifebit CloudOS platform. This implementation provides scalable, secure, and governance-compliant computation of gene-level burden scores across population-scale cohorts. The workflow harmonises variant annotation, quality control, and chunked data aggregation within modular, reproducible processes designed for high-throughput execution on cloud-native infrastructure. By enabling robust, portable, and auditable gene-level scoring across large rare-disease sequencing datasets, this framework enhances analytical resolution and supports downstream statistical prioritisation, integrative phenotype matching, and hypothesis generation within genotype-to-phenotype diagnostic workflows.
Glasenapp, M. R.; Yee, M.-C.; Symons, A. E.; Cornejo, O. E.; Garcia, O. A.
Show abstract
Accurate HLA typing is critical for transplantation, pharmacogenomics, and disease risk prediction, yet short-read approaches cannot resolve the HLA region's extreme polymorphism. Long-read sequencing improves resolution, but its adoption has been limited by higher cost, reduced base accuracy, limited throughput, and reliance on long-range PCR. To overcome these limitations, we present a multiplexed long-read hybrid capture workflow for PacBio and Oxford Nanopore sequencing that enriches all classical HLA loci and the complete HLA Class III region. A single-step enzymatic fragmentation and barcoding strategy enables automated library prep. We also introduce HLA-Resolve, an HLA typing program optimized for HiFi reads, and validate workflow performance against the Genome in a Bottle, Human Pangenome Reference Consortium, and International Histocompatibility Working Group benchmarks using 32 geographically diverse samples. These advances offer a cost-effective approach for high-resolution HLA typing with clinical applicability and enable investigation of the role of HLA Class III variation in disease.
Abderrazzaq, H.; Singh, M.; Babb, L.; Bergquist, T.; Brenner, S. E.; Pejaver, V.; O'Donnell-Luria, A.; Radivojac, P.; ClinGen Computational Working Group, ; ClinGen Variant Classification Working Group,
Show abstract
Insertions and deletions (indels) represent a substantial source of genetic variation in humans and are associated with a diverse array of functional consequences. Despite their prevalence and clinical importance, indels, particularly short in-frame indels, remain critically understudied compared to single nucleotide variants and are challenging to interpret clinically. While many computational predictors for missense variants have been rigorously evaluated and calibrated for clinical use, the clinical utility of tools for in-frame indels remains uncertain. To address this gap, we have calibrated in-frame indel prediction tools for clinical variant classification. We constructed a high-confidence dataset of in-frame indel variants ([≤] 50bp) from clinical and population databases and estimated the prior probability of pathogenicity of a rare in-frame indel observed in a disease-associated gene, and of an insertion and deletion separately. Using a previously developed statistical framework based on local posterior probabilities, we then established score thresholds for eight computational tools, corresponding to distinct evidence levels for pathogenic and benign classification according to ACMG/AMP guidelines. All in-frame indel predictors evaluated here reached multiple evidence levels of pathogenicity and/or benignity, demonstrating measurable clinical value. However, these models consistently exhibited lower performance levels compared to missense predictors, highlighting the need for improved computational approaches for indel classification.
Mehta, M.; Ahmed, K.; Hussein, R.; Tavares, E.; Berberovic, Z.; Adele, R.; D'Souza, A.; Gu, B.; Wilson, M. D.; Ivakine, E.; Monnier, P. P.; Heon, E.; Vincent, A.
Show abstract
Transgenic mouse models are indispensable for dissecting disease mechanisms; yet, their interpretability is frequently compromised by cryptic genomic alterations introduced during transgenesis. Thus, robust quality control strategies are needed to elucidate integration architecture and evaluate model performance when such unintended events occur. Here, we applied unbiased whole-genome long-read sequencing using the PacBio Revio to investigate a mouse model exhibiting unexpected transgene silencing, originally designed to recapitulate autosomal-dominant hereditary macular dystrophy driven by upregulation of a ZZEF1-ALOX15 fusion gene. Long-read sequencing analysis revealed a [≥]29-kb head-to-tail concatemer containing more than three copies of the transgene vector. Reconstruction of transgene-genome junctions revealed off-target integration of the concatemer into the calcium-sensing receptor gene (Casr), along with exogenous E. coli DNA, that together defined final transgene architecture. 5-methylcytosine profiling identified hypermethylation of the transgene promoter and additional phenotyping indicated disruption of endogenous Casr function resulting from the rearrangement. Our workflow enabled direct detection of transgene concatenation and off-target mapping. These findings establish long-read sequencing as a powerful and scalable quality control standard for genetically engineered animal models, uniquely capable of uncovering hidden genomic complexity, resolving aberrant phenotypes, and enhancing the reliability of in vivo disease modelling.
Difilippo, V.; Saba, K. H.; Wallander, K.; Styring, E.; Nathrath, M.; Baumoer, D.; Haglund de Flon, F.; Nord, K. H.
Show abstract
To streamline molecular profiling of tumor biopsies, we developed SarcDBase, an openly accessible tool that extracts and interprets clinically relevant genetic alterations from next-generation sequencing data. By automatically linking identified variants to curated, user-defined reference lists, SarcDBase minimizes the need for specialized expertise and reduces the burden of manual data processing. The platform delivers detailed molecular profiles, diagnostic insights and an intuitive interface for comprehensive interpretation. SarcDBases performance was evaluated in a heterogeneous cohort of 204 deep-sequenced bone and soft tissue tumors. In most cases (81%), its interpretation closely matched the curated post-sequencing diagnosis. Discrepancies mainly occurred in samples lacking diagnostically informative mutations. In some instances, SarcDBase flagged rare or unexpected alterations, including previously unreported gene fusions. This highlights SarcDBases dual potential as both an interrogative research tool and facilitator of molecular diagnostics, especially for reclassification of diagnostically challenging tumor types.
Yu, J.; Darmofal, M.; Waters, M.; Choy, J.; Tran, T. N.; Fu, C.; Morales, L.; U, K.; Levine, R. L.; Schultz, N.; Berger, M. F.; Morris, Q.; Jee, J.
Show abstract
General-purpose large language models (LLMs) are trained on large corpora to acquire broad knowledge, but whether LLMs can replace, or augment, task-specific models is unclear. We evaluated LLMs on three real-world, clinically important tumor genomic interpretation tasks, in order of increasing difficulty: (i) distinguishing tumor from non-tumor mutations (n=34,415 variants), (ii) distinguishing driver from passenger mutations (n=13,469 variants), and (iii) inferring cancer type from tumor sequencing reports across multiple assays and institutions (n=102,791 samples). The best general-purpose LLMs performed as well as the benchmark tailor-made predictor for task (i). Ensembling tailor-made models with zero-shot LLMs improved their performance for tasks (i) and (ii). For task (iii), LLMs outperformed or supplemented tailor-made models on out-of-distribution data. Without fine-tuning, current LLMs already can be useful in clinical genomic interpretation by adding complementary expertise to tailor-made, state-of-the-art predictors.
Miller, A. R.; Anderson, J. J.; Hernandez Gonzalez, M. E.; Rao Venkata, L. P.; Stonerock, E.; Mashburn-Warren, L.; Daley, A.; Leonard, J.; Pindrik, J.; Shaikhouni, A.; Boue, D. R.; Thomas, D. L.; Pierson, C. R.; Ostendorf, A. P.; Mardis, E. R.; Koboldt, D. C.; Miller, K. E.; Bedrosian, T. A.
Show abstract
Somatic variants are a prominent cause of epilepsy-associated cortical malformations, but about half of patients undergoing genetic testing have no finding due partly to limitations in variant detection. Most studies have focused on single-nucleotide variants or small indels that are accessible to short-read sequencing technologies, but somatic structural variants are also emerging as important contributors despite their unique detection challenges. Optical genome mapping (OGM) is a promising methodology for the detection of structural variants, but requires high quality, high molecular weight DNA from clinical specimens. Here we successfully optimize a protocol for OGM of surgically-resected patient brain tissue which yields [~]450x effective coverage - suitable for detecting somatic variants at low allele fractions. We apply this approach to brain specimens from four patients with epilepsy. OGM identifies large and complex mosaic structural variants ranging from 7-40% variant allele fraction, most of which are not captured by short-read exome sequencing of the same specimen. In one patient with a known germline DEPDC5 variant, OGM reveals a somatic variant - a 13.2kb deletion in DEPDC5 at approximately 20% VAF - consistent with the established two-hit model in DEPDC5-associated lesional epilepsies. By resolving the breakpoints in PacBio HiFi sequencing data, we identify a mechanism for this somatic deletion, mediated by recombination of two Alu elements flanking the region. Our findings demonstrate that OGM is a robust and complementary tool for detecting somatic structural variation in human brain tissue, with potential to improve diagnostic yield and refine genotype-phenotype correlations in neurological disorders.
Holt, K. E.; Argimon, S.; Chaput, D. L.; Couto, N.; Dyson, Z. A.; Foster-Nyarko, E.; Goodman, R. N.; Hawkey, J.; Knight, G. M.; Nagy, D.; Prasad, A. B.; Sanchez-Buso, L.; Tsang, K. K.; Berends, M. S.
Show abstract
Microbial whole-genome sequence data is now generated at scale, including to support antimicrobial resistance (AMR) surveillance and understand resistance mechanisms, yet analytical infrastructure for systematically linking AMR genotypes to measured phenotypes remains fragmented. Here we present AMRgen, an R package to support systematic AMR genotype-phenotype analysis. AMRgen imports and harmonises genotypic data from common bioinformatics tools, alongside phenotypic data from automated antimicrobial susceptibility testing instruments and public repositories. It supports common analyses linking data to reference distributions, modelling associations, quantifying concordance, and producing publication-ready visualisations including UpSet plots that jointly display genotypic marker combination frequencies and associated phenotypic distributions. We demonstrate AMRgens utility using publicly available surveillance data for World Health Organization priority AMR pathogens, Neisseria gonorrhoeae, Klebsiella pneumoniae, Escherichia coli and Salmonella enterica. AMRgen, available free and open-source at https://AMRgen.org, provides a reproducible end-to-end foundation for genotype-phenotype research in AMR genomics, clinical microbiology, and public health surveillance.
Pieczarka, M.; Pienkowski, P.; Konowalska, P.; Grubarek, S.; Hajto, J.; Hoinkis, D.; Piechota, M.; Borczyk, M.; Korostynski, M.
Show abstract
Pharmacogenetics (PGx) has traditionally focused on a small number of high-impact variants affecting drug response due to the fact that PGx studies are labor-intensive and therefore low-throughput. Population biobanks linked to electronic health records (EHRs), including the UK Biobank (UKB) with prescription data for [~]230,000 individuals offer opportunities to scale PGx research. This, however, comes with a challenge as EHRs do not provide direct treatment response outcomes. One way to overcome this is to draw indirect drug response phenotypes from prescription records. Here, we propose preSCRIPT, a framework to filter and annotate raw prescriptions from the UKB to derive phenotypes for analyses which includes an algorithm to distinguish short prescription gaps from true dose changes. As a proof of concept, we applied preSCRIPT to warfarin, paracetamol, codeine, amitriptyline, simvastatin, aspirin, and amlodipine and derived therapy length and median daily doses. We tested associations for those seven drugs and two phenotypes across SNPs, cytochrome P450 (CYP) genes, and HLA alleles. We replicated known associations such as CYP2D6 variants with amitriptyline therapy length and dose, CYP2C9/CYP4F2/CYP2C19 with warfarin dose, and CYP2D6 with codeine dose. For drugs without formal PGx guidelines, we identified an association between CYP2D6 activity and aspirin therapy length and several SNPs, including rs62471929 (CYP3A5), a variant for amlodipine dose, replicated in an independent hold-out set. Overall, our study shows that preSCRIPT can recover established PGx associations, prioritize exploratory novel candidate loci, and may serve as a tool for large-scale pharmacogenomics.
Rong, Y.; Vysotskiy, M.; Chen, P. A.; Agrawal, E.; Marsh, E.; Wang, C. H.; Carr, D.; Dajani, R.; Gittens Maker, B.; Yu, F.; Goodman, D. B.; Shifrut, E.; Puck, J. M.; Marson, A.; Nguyen, D. N.
Show abstract
Distinguishing pathogenic from benign mutation is critical for genetic diagnosis. A CRISPR-targeted saturation genome editing (SGE) platform in primary human cells assessed 489 single nucleotide variants (SNVs) in exon 5 of IL2RG, the gene causing X-linked SCID. The functional impact was clearly defined for 470 variants, agreeing with 100% (18/18) of ClinVar-deposited benign or likely benign annotations, and 100% (42/42) of pathogenic or likely pathogenic annotations. We discovered 90 novel loss-of-function mutations and validated an expected block in T-lymphocyte differentiation from edited hematopoietic stem cells.
Cho, H.; Zhang, Y.; Zhou, J.; Daggar, A.; Kang, S.; Mannan, R.; Cao, X.; Dhanasekaran, S. M.; Chinnaiyan, A. M.
Show abstract
Single-cell RNA sequencing (scRNA-seq) effectively captures the differences in transcriptomic landscape of cell types and cell states between benign and cancer tissues. Pooling publicly available datasets distributed across independent studies enables increased sample representation and cross-study comparisons. Here we present a harmonized scRNA-seq atlas of the human prostate constructed by integrating 17 available studies, comprising 163 samples from 106 donors. The dataset contains benign tissue, primary tumors, and metastatic disease profiles. Raw sequencing FASTQ data files were uniformly reprocessed to minimize technical variability. Study metadata were curated and standardized using a unified schema capturing donor identity, tissue site, disease context, and histologic grade. Post quality control, the integrated dataset contains 754,000 high-quality cells. Harmonized cell type annotations were generated using a pseudobulk correlation framework informed by multiple reference resources. The workflow identified 17 distinct cell types representing epithelial, mesenchymal, and immune compartments of the prostate. The processed expression matrices, standardized metadata, and analysis workflows are publicly available to support reproducible analysis and enable exploration of heterogeneity across prostate disease states.
Chang, H.-C.; Shi, Y.; Cheng, H.; Zou, J.; Chang, A. C.-C.; Schlegel, B. T.; Wang, W.; Brown, D. D.; Chen, F.; Wang, S.; Li, D.; Sai, R.; Michel, N.; Oesterreich, S.; Lee, A. V.; Tseng, G. C.
Show abstract
Accurately inferring copy number variation (CNV) from scRNA-seq data is critical for identifying malignant cells, reconstructing tumor subclonal architecture, and uncovering the genomic drivers that dictate cancer cell biology. However, the performance of existing tools varies significantly, and current benchmarks lack the breadth of datasets and methods necessary to provide definitive guidance. We present a comprehensive benchmark of 12 CNV inference methods across 28 real datasets (>100,000 cells) and diverse synthetic datasets. By evaluating methods based on malignant cell classification accuracy, CNV inference accuracy, scalability, and robustness, we establish a definitive practitioners guideline: allele-aware methods like Numbat excel when high-quality allelic inference can be achieved, whereas expression-centric tools such as Clonalscope, CopyKAT, inferCNV, and SCEVAN remain reliable when raw sequencing data are unavailable. Our study provides both a practical decision-making framework for researchers and a public repository of standardized CNV profiles to catalyze further methodological innovation.
Meng, M.; Liu, L.; Du, Q.; Zhou, X.; Tian, Y.; Sun, K.; Li, N.; Zhang, P.; Lian, X.; Fan, N.; Zhu, N.; Li, S.; Mao, A.; Li, Y.; Zou, G.
Show abstract
Background: Artificial intelligence (AI)-driven variant prioritization has demonstrated substantial utility in expediting genetic diagnosis by ranking the most likely causative variants. While a variety of tools have been developed, few address the unique clinical and technical constraints in prenatal genetic diagnosis. Methods: We introduce Berrylyzer, a novel, end-to-end variant prioritization system applied to prenatal diagnosis.Inspired by clinician's reasoning process during variant interpretation, Berrylyzer applies a modular, stepwise scoring architecture that jointly integrates phenotypic and genomic evidence and delivers a ranked list of candidate variants, achieving high computational efficiency without compromising analytical rigor. Moreover, Berrylyzer natively supports both structured ontologies and free-text clinical narratives, enabling flexible integration into diverse clinical environments. Its performance was rigorously evaluated across two independent, real-world prenatal cohorts and benchmarked against three state-of-the-art methods: Xrare, Exomiser, and PhenIX. Results: Across the two datasets, Berrylyzer ranked 56.41% and 58.12% of diagnostic variants first, and achieved recall rates of 94.02% and 97.42% within top 20, respectively. Berrylyzer outperformed Xrare (85.19% and 87.08%), Exomiser (84.90% and 85.98%), and PhenIX (82.05% and 88.93%). Stratified analysis consistently demonstrated superior performance across diverse disease categories, inheritance patterns, and analytical strategies. Notably, Berrylyzer exhibited robustness regardless of phenotype forms, yielding comparable top 20 recall rates for free-text descriptions and standardized terminologies. Conclusion: Berrylyzer represents an accurate, interpretable, and computationally lightweight variant prioritization system for prenatal genetic diagnosis. The superior performance across heterogeneous diagnostic contexts enables it as a practical solution for seamless integration into clinical pipelines, thereby advancing precision medicine in prenatal settings.
Ball, R. L.; Klein, A.; Gerring, M. W.; Berger-Liedtka, A. K.; Kim, M. J.; Berry, M. A.; Gargano, M. A.; Mukherjee, G.; Fisher, H. S.; Nichols-Meade, T.; Castellanos, F.; Smith, C. L.; Karlebach, G.; Murray, S. A.; Bult, C. J.; Robinson, P. N.; Chesler, E. J.
Show abstract
Choosing an appropriate mouse genetic background is a persistent challenge for successful translation of preclinical disease modeling. We present Strain Recommender, a genomic framework that prioritizes inbred mouse strains as relatively vulnerable or resilient to a disease state using disease-associated gene signatures and strain-specific transcriptome predictions. The method represents disease states as weighted gene scores, ranks 657 strains based on resemblance to the disease state, and estimates uncertainty via a permutation-derived false positive rate (FPR). In a prospective validation of connective tissue disorder predictions, vulnerable and resilient Collaborative Cross strains showed significantly different cardiovascular abnormalities. In a global retrospective validation predicting previously reported strain background effects, Strain Recommender achieved [≥] 90% sensitivity for 86.6% of diseases with 94.4% mean sensitivity (95% CI: 94.0-94.8%) across 5,890 diseases, including 92.3% (95% CI: 91.6-93.0%) for 2,598 rare diseases, demonstrating its potential to improve the validity of mouse models of human disease.
Su, Y.; Lin, Y.-J.
Show abstract
Missense variant interpretation remains a central challenge in clinical genomics. Missense pathogenicity predictors achieve strong performance, but many emphasize protein-level consequences or overlapping annotation priors. Whether genomic language models add non-redundant nucleotide-context signal to missense interpretation remains unclear. Here, we systematically adapted genomic language models to ClinVar missense pathogenicity prediction across back-bone architectures, representation strategies, classifier heads, and adaptation regimes. In our analysis, variant-position embeddings consistently outperformed pooled sequence representations, multi-species pretraining provided the strongest backbone-level advantage, and low-rank adaptation generalized better than full fine-tuning. The resulting fine-tuned model, GLM-Missense, substantially outperformed zero-shot scoring from the same pretrained model. To test whether GLM-Missense contributes information beyond existing methods, we built MetaMissense, an XGBoost ensemble combining GLM-Missense with AlphaMissense, ESM1b, REVEL, CADD, SIFT, and PolyPhen-2. GLM-Missense showed the lowest concordance with other predictors, retained the strongest partial association with pathogenicity after controlling for the other predictors, and ranked as the most informative non-ensemble input to MetaMissense. MetaMissense achieved the best performance in both cross-validation and held-out testing. Analyses of variants correctly classified by GLM-Missense but misclassified by several established predictors suggested two patterns. First, part of the GLM-Missense signal may reflect splice-relevant exonic context. Second, GLM-Missense appears to add value in settings where other predictors may overweight allele frequency, gene-level constraint, or amino-acid-change severity. However, these features explained only about 10% of the distinction between the GLM-Missense-correct subset from the background. Together, our results demonstrate that fine-tuned genomic language models contribute complementary nucleotide-context information to missense variant interpretation.
Abdelhakim, M.; Althagafi, A.; SCHOFIELD, P.; Hoehndorf, R.
Show abstract
Genotype-phenotype databases are essential for variant interpretation and disease gene discovery. Genetic variation differs among human populations, mainly in allele frequencies and haplotype patterns shaped by ancestry and demographic history. Population-specific genotypes can influence traits and disease risk; this makes population specific characterization important. Most existing resources focus on the characterization of a population's genetic background, but do not represent the resulting phenotypes. We have developed PAVS (Phenotype-Associated Variants in Saudi Arabia), a curated, publicly accessible database that integrates 5,132 Saudi clinical cases from four Saudi cohorts and 522 cases from analysis of a mixed-population cohort, together with 1,856 cases from the Deciphering Developmental Disorders study (DDD) and 9,588 literature phenopackets. Each case record describes patient-level phenotypes, encoded with the Human Phenotype Ontology (HPO), and links them to genomic variants, gene identifiers, zygosity, pathogenicity classifications, and disease diagnoses mapped to standardized disease terminologies. The data is represented in Phenopackets format and as a knowledge graph in RDF. Additionally, a web interface provides phenotype-based similarity search, gene and variant browsers, and an HPO hierarchy explorer. We evaluate the utility of the phenotype annotations for gene prioritization using semantic similarity. While there are clear differences to global literature-curated databases, phenotypes in PAVS can successfully rank the correct gene at high rank (ROCAUC: 0.89). PAVS addresses a gap in population-specific genotype-phenotype resources and provides a benchmark for phenotype-driven variant prioritization in under-represented populations.